Add a script to plot multi-run experiment results #122
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR adds a standalone utility to address the need for comparing Trinity's RFT experiments, which require multiple runs due to their stochastic nature. The script parses
TensorBoard logsfrom repeated experiments, aggregates the results, and plots them with confidence intervals.Here is a sample plot generated by the script, showing the evaluation performance on the MATH500 benchmark for Qwen2.5-1.5B that utilize GRPO on the GSM8K and MATH datasets respectively:
Example Usage:
In current version, this script functions as a standalone utility. Users need to manually specify the paths and configurations for each experiment in a YAML file. To generate the plots, run the following command:
[TODO] Automate the process of running repeated experiments and generating comparison plots.
Checklist
Please check the following items before code is ready to be reviewed.